降级扩散概率模型(DDPM)最近在许多生成任务中都取得了领先的性能。但是,继承的迭代采样过程成本阻碍了他们的应用程序到文本到语音部署。通过有关扩散模型参数化的初步研究,我们发现以前基于梯度的TTS模型需要数百或数千个迭代以保证高样本质量,这对加速采样带来了挑战。在这项工作中,我们提出了Prodiff的建议,以用于高质量文本到语音的渐进快速扩散模型。与以前的估计数据密度梯度的工作不同,Prodiff通过直接预测清洁数据来避免在加速采样时避免明显的质量降解来参数化denoising模型。为了通过减少扩散迭代来应对模型收敛挑战,Prodiff通过知识蒸馏减少目标位点的数据差异。具体而言,Denoising模型使用N-Step DDIM教师的生成的MEL光谱图作为训练目标,并将行为提炼成具有N/2步的新模型。因此,它允许TTS模型做出尖锐的预测,并通过数量级进一步减少采样时间。我们的评估表明,Prodiff仅需要两次迭代即可合成高保真性MEL光谱图,同时使用数百个步骤保持样本质量和多样性与最先进的模型竞争。 Prodiff在单个NVIDIA 2080TI GPU上的采样速度比实时快24倍,这使得扩散模型实际上是第一次适用于文本到语音综合部署。我们广泛的消融研究表明,Prodiff中的每种设计都是有效的,我们进一步表明,Prodiff可以轻松扩展到多扬声器设置。音频样本可在\ url {https://prodiff.github.io/。}上找到
translated by 谷歌翻译
平均场游戏(MFGS)是针对具有大量交互代理的系统的建模框架。他们在经济学,金融和游戏理论中有应用。标准化流(NFS)是一个深层生成模型的家族,通过使用可逆映射来计算数据的可能性,该映射通常通过使用神经网络进行参数化。它们对于密度建模和数据生成很有用。尽管对这两种模型进行了积极的研究,但很少有人注意到两者之间的关系。在这项工作中,我们通过将NF的训练视为解决MFG来揭示MFGS和NFS之间的联系。这是通过根据试剂轨迹重新解决MFG问题的实现,并通过流量体系结构对所得MFG的离散化进行参数化。通过这种联系,我们探讨了两个研究方向。首先,我们采用表达的NF体系结构来准确地求解高维MFG,以避开传统数值方法中维度的诅咒。与其他深度学习方法相比,我们的基于轨迹的公式编码神经网络中的连续性方程,从而更好地近似人口动态。其次,我们对NFS进行运输成本的培训正规,并显示了控制模型Lipschitz绑定的有效性,从而获得了更好的概括性能。我们通过对各种合成和现实生活数据集的全面实验来展示数值结果。
translated by 谷歌翻译
自动编码是表示学习的一种流行方法。常规的自动编码器采用对称编码编码程序和简单的欧几里得潜在空间,以无监督的方式检测隐藏的低维结构。这项工作介绍了一个图表自动编码器,其中具有不对称编码编码过程,该过程可以包含其他半监督信息,例如类标签。除了增强使用复杂的拓扑结构和几何结构处理数据的能力外,这些模型还可以成功区分附近的数据,但仅与少量监督相交并与歧管相交。此外,该模型仅需要较低的复杂性编码器,例如局部线性投影。我们讨论了此类网络的理论近似能力,基本上取决于数据歧管的固有维度,而不是观测值的维度。我们对合成和现实世界数据的数值实验验证了所提出的模型可以有效地通过附近的多类,但分离不同类别,重叠的歧管和具有非平凡拓扑的歧管的数据。
translated by 谷歌翻译
船舶重新识别技术是智能运输系统的重要组成部分,也是海洋监视所需的视觉感知任务的重要组成部分。但是,与陆地上的情况不同,海上环境是复杂且可变的,样品较少,并且在海上进行船舶重新识别更加困难。因此,本文提出了一种转移动态对准算法,并模拟海上船只的摇摆状况,使用良好的和类似的军舰作为测试目标,以改善识别困难,从而应对复杂的海洋条件和复杂的海洋条件和影响的影响。讨论不同类型的血管作为转移对象的影响。实验结果表明,改进的算法将平均平均准确性(MAP)提高了10.2%,第一个命中率(RANK1)平均提高了4.9%。
translated by 谷歌翻译
如今,DNN在边缘设备上无处不在。随着其重要性和用例的越来越重要,它不太可能将所有DNN包装到设备内存中,并期望每个推断都被加热。因此,寒冷的推断,读取,初始化和执行DNN模型的过程变得司空见惯,并且迫切要求优化其性能。为此,我们提出了NNV12,这是第一个为冷推理NNV12优化的设备推理引擎是在3个新颖的优化旋钮上构建的:为每个DNN操作员选择适当的内核(实现),绕过权重转换过程,以缓存该帖子。 - 在磁盘上转移权重,并在不对称处理器上进行了许多核的管道执行。为了解决巨大的搜索空间,NNV12采用了基于启发式的计划来获得近乎最佳的内核计划计划。我们完全实施了NNV12的原型,并在广泛的实验中评估了其性能。它表明,与Edge CPU和GPU上的最先进的DNN发动机相比,NNV12的达到15.2倍和401.5倍。
translated by 谷歌翻译
深度估计在现有的基于学习的多视图立体声方法中解决了作为回归或分类问题。虽然这两种表示最近展示了它们的优异性能,但它们仍然具有明显的缺点,例如,由于间接学习成本量,回归方法往往会过度装备,并且由于其离散预测而不能直接推断出精确深度的分类方法。在本文中,我们提出了一种新的代表性,称为统一,统一回归和分类的优势。它可以直接限制等级的成本量,但也实现了像回归方法的子像素深度预测。为了挖掘统一的潜力,我们设计了一个名为统一焦点损失的新损失函数,这更加统一,合理地打击样本不平衡的挑战。结合这两个负担的模块,我们提出了一个粗略的框架,我们称之为UNIMVSNet。首先在DTU和坦克和寺庙和寺庙基准测试的结果验证了我们的模型不仅执行最佳,还具有最佳的概括能力。
translated by 谷歌翻译
场景图生成(SGG)由于其复杂的成分特性,仍然是一个具有挑战性的视觉理解任务。大多数以前的作品采用自下而上的两阶段或基于点的单阶段方法,通常遭受开销时间复杂性或次优设计假设。在这项工作中,我们提出了一种新颖的SGG方法来解决上述问题,其将任务制定为双层图形施工问题。为了解决问题,我们开发一个基于变换器的端到端框架,首先生成实体和谓词提议集,然后推断定向边缘以形成关系三态。特别地,我们基于结构谓词发生器开发新的实体感知谓词表示,以利用关系的组成特性。此外,我们设计了一个曲线图组装模块,以推断基于我们的实体感知结构的二分明场景图的连接,使我们能够以端到端的方式生成场景图。广泛的实验结果表明,我们的设计能够在两个具有挑战性的基准上实现最先进的或可比性的性能,超越大多数现有方法,并享受更高的推理效率。我们希望我们的模型可以作为基于变压器的场景图生成的强大基线。
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译